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ABSTRACT 

The challenge of "Big Data" extends beyond the simple stor- 
age and management of large volumes of data. Science that 
involves and depends upon large amounts of data, also re- 
quires overcoming challenges in multiple other areas, includ- 
ing managing large-scale data distribution as well as co- 
placement and scheduling with computing resources. Al- 
though there exist multiple approaches to addressing each 
of these challenges, an integrative approach is missing; fur- 
thermore extending existing functionality or enabling cross- 
implementation capability for existing implementations re- 
mains difficult at best. To address the fundamental chal- 
lenges of co-placement and scheduling of data and compute 
in heterogenous and distributed environments with interop- 
erability and extensibility as first-order concerns, we define 
the concept of Pilot-Data, in analogy with, and symmetri- 
cal to, Pilot-Jobs. In this paper, we design and implement 
a Pilot-Data prototype, and deploy it on multiple produc- 
tion distributed cyberinfrastructure. We validate the con- 
cept of Pilot-Data by establishing that it provides a simple 
abstraction for managing data placement, whilst supporting 
interoperability across distributed environments and late- 
binding. Our experiments characterize the performance of 
an implementation of the Pilot-Data concept. They show a 
level of performance that is suitable for most large-scale sci- 
entific applications on (production) distributed cyberinfras- 
tructure. We demonstrate how the concept of Pilot-Data 
also provide the basis upon which to build tools and sup- 
port capabilities like affinity which in turn can be used for 
advanced data-compute co-placement and scheduling. 

1. INTRODUCTION 

Data has become a critical factor in many science disci- 
plines [l]. As a consequence, the data generated by scientific 
applications, instruments and sensors is experiencing an ex- 
ponential growth in volume, complexity and scale of distri- 
bution. The ability to analyze prodigious volumes of data 
requires flexible and novel ways to manage distributed data 
and computations. Furthermore, analytical insight is in- 
creasingly dependent on integrating different and distributed 



data sources, computational methods and resources [2]. 

Working with large volumes of data involves many chal- 
lenges beyond its instantiation, storage and management. 
The specific challenge we address is that of co-placement 
and scheduling of data and compute in a distributed en- 
vironment. The difficulty in being able to effectively and 
reliably manage co-placement and scheduling is in part due 
to the challenges inherent in distributed environments, com- 
pounded by an increasingly rich but incompatible and het- 
erogeneous data-cyberinfrastructure incorporating diverse 
storage systems, data management semantics and multiple 
transfer protocols/approaches. Although these challenges 
have existed for a while, they are having progressively larger 
impacts on the performance and scalability of scientific ap- 
plications; in particular, most currently available scientific 
applications still operate in legacy modes, e. g. they often 
require manual data management (e. g. the stage-in and out 
of files) and scheduling. 

Some of the questions that arise include: (i) How to man- 
age the placement, scheduling and distribution of data effec- 
tively so as to be available when needed by a computational 
task? (ii) What are the right abstractions for coupling com- 
pute and data that hold for a range of application types 
and infrastructures? Furthermore, how can both system- 
level and application-level information be incorporated to 
support dynamic and late-binding of compute and data to 
resources? (iii) Last but not least, how can the inherent 
heterogeneity be handled? How can one provide an inter- 
operable, uniform access to these heterogeneous distributed 
data-cyberinfrastructure and resources? 

These challenges are common to many scientific applica- 
tions. In this paper we focus on two well known exemplars: 
climate modeling as performed by the Earth System Grid 
Federation (ESGF) and Next-Generation Sequencing analy- 
sis applications [sj. But before that, we mention that there 
exist multiple other challenges viz., data security, data ac- 
cess rights and policy, and data semantics and consistency. 
These are all important determinants of the ultimate usabil- 
ity and usage-modes but we will not consider them to be in 
scope of the work of this paper. Our decision is in part 
explained by the fact that our work is ultimately aimed to- 
wards the development of abstractions and middleware for 
production distributed cyberinfrastructure (DCI) which will 
be agnostic to specific security and data-sharing policies." 

Application Exemplars 

ESGF: The overall data generated and stored is 2-10 PB, of 
which the most frequently used data is 1-2 PB in size. The 
data is generated by a distributed set of climate centers, and 



it is stored in a distributed set of federated archives. The 
data is used by a distributed set of users, who either run data 
analyses on a climate center with which they are associated, 
or they gather data from the ESGF to a local system for 
their analyses. Furthermore, data which is generated over 
time causes real-time changes — spatial and temporal, in the 
ESGF dataset; the scheduling of data analysis jobs needs to 
be responsive to these spatio-temporal data changes. 
NGS analytics: In addition to being a big data problem 
(O(Terabytes)), NGS analytics is also a computationally de- 
manding and distributed computing problem. The computa- 
tional demands arise from the often complex and intensive 
analysis that has to be performed on data, which in turn 
arise from algorithms that are designed to account for repe- 
titions, errors and incomplete information. The distributed 
aspects arise at multiple levels: for example, the simple act 
of having to move data from source (generation) to the des- 
tination where computing (analysis) will occur is a challenge 
due to the volumes involved. Trade-offs exists between the 
cost /challenges in distributing data versus I/O saturation 
or memory bottlenecks. When coupled with the compute 
intensive nature of the problem, it soon emerges that a fun- 
damental challenge is not only whether to distribute, but 
where to distribute, what to distribute (should the comput- 
ing move to the data, or the data move to the compute), and 
how to distribute (what tools and infrastructure to use). 

The execution of both ESGF and NGS applications on 
distributed infrastructure can have dynamic aspects when 
optimized resource usage is considered. For example, work- 
load decomposition and distribution must be determined dy- 
namically in order to optimally use resources, e.g., compute 
resource selection can be based on data location, processing 
profile and/or network capacity. 

Overview and Outline 

As the ESGF and NGS analytic examples establish, the real 
bottleneck is the collective distributed compute-data man- 
agement and scheduling problem. Pilot-Jobs have a demon- 
strable record of effective distributed resource utilization and 
supporting a broad range of application types [i] , [5] . In this 
paper, we explore the generalization of the Pilot- Job con- 
cept, via Pilot-Data, to that of Pilot- Abstractions as a way 
to efficiently manage distributed data and its dynamic place- 
ment with respect to computation. We associate our work 
with Pilot- Jobs so as to take advantage of their established 
capability of providing distributed computing resources. By 
doing so, we also alleviate a significant gap that has come to 
the fore as typical data- volumes used/required by science ap- 
plications has increased, viz., there is insufficient and incom- 
plete support for data movement and placement for many 
Pilot- Job implementations |4j. 

Specifically, we introduce Pilot-Data (PD) as a novel ab- 
straction for data-intensive applications that provides late- 
binding capabilities for data by separating the allocation of 
physical storage and application-level Data-Units. Although 
there are multiple pre-existing approaches that dynamically 
couple data to compute tasks, there are two distinct fea- 
tures of the Pilot-Data approach: first, Pilot-Data provide 
a mechanism that can work in conjunction with Pilot- Jobs 
but is not dependent upon them. Second, Pilot-Data pro- 
vides a general approach to data-compute placement, in that 
it is not constrained to a specific scheduling algorithm or 
infrastructure; the only potential limitation in arises as a 



consequence by the Pilot- Job that it may be paired with. 

The role of a conceptual abstraction is to be able to reason 
about a capability without having to worry about implemen- 
tation details of that capability. Specifically, the suggestion 
that Pilot-Data is a conceptual abstraction for distributed 
data is predicated upon the fact that like any valid abstrac- 
tion, it must provide applications with a unifying program- 
ming model and usage modes. As an example of how this is 
achieved, Pilot-Data provides a simple and useful notion of 
distributed logical location that from an application's per- 
spective is invariant over the lifetime; thus it supports both 
a decoupling in time (i.e., allowing late-binding) and space 
between actual physical infrastructure and the application 
usage of that infrastructure. Pilot-Data must thus retain the 
fiexibility to be used with different CI whilst not constrained 
to different specific modes of execution or usage. 

A critically important point to note is that our focus 
is on addressing the challenges outlined in the context of 
production distributed CI and not research prototypes, or 
"feature rich" but closed or specific data-cyberinfrastructure 
and back-end systems. We believe there are three pri- 
mary macroscopic architectures for scalable production dis- 
tributed cyberinfrastructure: (i) the first when the data is 
essentially localized, e.g., "poured" into a cloud, or given 
the volumes of data, the scale over which localization oc- 
curs is relatively small. Interestingly, given the ratio of the 
high computational and data-storage capacity to the num- 
ber of sites in XSEDE, it too can be classified in this cat- 
egory; (ii) where the data is decomposed and distributed 
(with multi-tier redundancy and caching) to an appropriate 
number of computing/analytical engines as available, e.g., 
as employed by the European Grid Initiative (EGI) [6] /Open 
Science Grid (OSG) [7| for particle physics and the discovery 
of the Higgs, and (iii) a hybrid of the above two paradigms, 
wherein data is decomposed and committed to several infras- 
tructures, which in turn could be a combination of either of 
the first two paradigms. 

We will show how Pilot-Abstractions can be used for all 
three architectural paradigms. However, it is not enough 
to suggest that an abstraction is compatible with a macro- 
scopic architectural paradigms, for there can be multiple im- 
plementations of a macroscopic architecture, e. g., OSG and 
EGI are very different implementations of essentially the 
same architecture, with significant differences in the mid- 
dleware, tools and services available. 

Combining the previous points form the basis for our 
claims that Pilot-Data, when taken in conjunction with 
Pilot- Jobs, provides a unified Pilot- Abstractions and presents 
a general and flexible solution to the collective distributed 
compute-data management and scheduling problem. Fur- 
thermore, our model and implementation of Pilot-Abstrac- 
tions is fundamentally agnostic to the underlying CI; our 
implementation interfaces seamlessly with SAGA [sj, which 
is a well established route to interoperability. 

In §2 we discuss related work with a view to providing 
the reader with a better appreciation for the scope of our 
work with respect to other distributed data scenarios. Be- 
fore discussing the Pilot-abstractions, we present a simple 
but general model in ^ to understand the primary compo- 
nents and trade-offs to determine compute-data placement; 
this model is independent of any specific infrastructure or 
approach. We begin §4 by discussing Pilot- Jobs and the 
P* Model - a minimal and complete model for Pilot- Jobs; 



we discuss how the P* Model supports a logical extension 
to Pilot-Data as an abstraction to support distributed data 
scenarios. §5 discusses data-compute placement in the con- 
text of Pilot- Jobs and Pilot-Data and introduces a Pilot- API 
as means of expressing the coupling of these entities. In or- 
der to establish and evaluate Pilot-Data as an abstraction 
for distributed data, we design and conduct a series of ex- 
periments in §6. We conclude with a discussion of the main 
lessons learned as well as relevant and future issues. 



2. RELATED WORK 

The landscape of solutions that have been devised over 
the years to address the challenges and requirements of dis- 
tributed data is very vast. Any discussion of relevant related 
work has to be focussed by design and limited by necessity. 
Thus in this section we provide a brief discussion of rele- 
vant efforts that either support compute-data co-placement 
and/or support data management in production DCI, such 
as XSEDE, OSG and EGI. 

2.1 Distributed Data Management 

A myriad of storage solutions exist, ranging from local 
and parallel filesystems, e.g. Lustre [9], to distributed data 
stores, such as Amazon S3 [lO]. Commonly, storage sys- 
tems can be accessed via different protocols, e. g. the vir- 
tual filesystem layer in Linux or a transfer protocol, such 
as GridFTP 11 . Often, these storage systems provide un- 
known quality of services, i. e. an application is only aware 



of high-level characteristics of a system and usually does not 
know a-priori what throughput and latencies it can expect. 

Several distributed data management systems have been 
built on top of these low-level storage systems to facili- 
tate the management of geographically dispersed storage re- 
sources. Systems like SRM 12 and iRODS [13] combine 



storage services with services for metadata, replica, transfer 
management and scheduling. Also, different services cover- 
ing singular aspects, such as replica management (e. g. the 
Replica Location Service (RLS) [T4| or the LCG File Cat- 
alogue (LFC) 15 ), exist. Globus Online [l6] is a hosted 
transfer service that is based on GridFTP. In the following 
we focus on some representative examples. 

Storage Resource Manager (SRM) is a type of storage ser- 
vice that provides dynamic file management capabilities for 
shared storage resources via a standardized interface. SRM 
is primarily designed as an access layer with a logical names- 
pace on top of different site-specific storage services. SRM 
aims to hide the complexity of different low-level storage ser- 
vices, but does not allow applications to control and reason 
about data locality; this is left to other components, e. g. 
a combination of information system and replica manager 
(BDII & LFC in the case of EGI). 

iRODS is a comprehensive distributed data manage- 
ment solution designed to operate across geographically dis- 
tributed, federated storage resources. Central to iRODS 
are the so called micro-services, i. e. the user defined con- 
trol logic. Micro-services are automatically triggered and 
handle pre-defined tasks, e. g. the replication of a data set 
to a set of resources. 

The Global Federated Filesystem (GFFS) [iS] is part of 
the Genesis II middleware and provides a global namespace 
on top of a heterogeneous set of storage resources. The 
system handles file movement and replication transparently. 



2.2 Pilot-like Approaches for Distributed Data 

Pilot- Jobs have been successful abstractions in distributed 
computing as evidenced by a plethora of PJ frameworks: 
Condor-G/Glide-in [l9], DIANE [20], PanDA 21 , ToPoS 
Nimrod/G [23] 
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and MyCluster j25j, to name 
a few. However, only a few of them provide integrated 
compute/data capabilities. For example, DIRAC [26] is a 
Pilot-based workload management system for compute and 
data, which is used in the LHC Computing Grid. The inte- 
grated data management system maintains replica locations 
and manages data transfers. For this purpose, the system 
interfaces to SRM storage resources. Another example is 
Swift [27], which provides a data management component 
called GDM. DIANE provides in-band data transfer func- 
tionality over its CORBA channel. 

2.3 Data-Compute Co-Scheduling 

There are several projects that aim towards integrated 
data and compute scheduling. For example, the Stork [28] 
data-aware batch scheduler provides advanced data and 
compute placement for Condor and DAGMan. Stork sup- 
ports multiple transfer protocols like, SRM, (Grid)FTP, 
HTTP and SRB. Romosan et al. 29 present another data- 
compute co-scheduling approach on top of Condor and SRM. 
Both approaches build on top of existing job scheduling and 
data-transfer and storage solutions. 

Another well-known example is the MapReduce frame- 



work Hadoop 30 . Hadoop provides a highly integrated en- 



vironment for data-intensive applications. The new Hadoop 
resource scheduler. Yarn [3l], tightly integrates the Hadoop 
filesystem with the compute resources of the cluster. Com- 
pute tasks are whenever possible placed at the same node as 
the data. Similar integrated capabilities are currently miss- 
ing from the scientific cyberinfrastructures. While Hadoop 
gained a lot of traction in local cluster environments, it lacks 
the ability to efficiently handle distributed data. 

Various abstractions for optimizing access and manage- 
ment of distributed data have been proposed: Filecule [32] 
is an abstraction that groups a set of files that are often used 
together, allowing an efficient management of data using 
bulk operations. This includes the scheduling of data trans- 
fers and/or replications. Similar file grouping mechanisms 
have been proposed by Amer et al.jSS], Ganger et al. [34] and 
BitDew [35] . Another example is DataCutter [36] a frame- 
work that enables exploration and querying of large datasets 
while minimizing the necessary data movements. 

Finally, different research on when to (potentially dynam- 
ically) distribute and replicate data has been conducted: for 
example, Foster ^7] and Bell [38] investigate different data 
replication management system and dynamic replication al- 
gorithms in the context of scientific data grids. A limita- 
tion of the previous approaches is that the systems and 
algorithms are usually constrained to system-level replica- 
tion, making it difficult for the user to control replication on 
application-level and employ dynamic replication strategies. 

A limitation of current data-cyberinfrastructures is the lack 
of integration between data and compute capabilities. Many 
systems solely focus on data or compute aspects leaving it 
to the application to integrate compute and data. Typi- 
cally, this leads to the necessity to manually move data in 
and out of the systems. Localities between data and com- 
pute infrastructure are often unknown and difficult to de- 



duce, thereby making it difficult for tfie application to opti- 
mize data and compute placements. System-level schedulers 
are usually not aware of the application characteristics (e. g. 
specific compute/data dependencies), which usually leads to 
non-optimal placement decisions. In many cases there is a 
need to manually allocate and manage resources on a very 
low level in order to achieve acceptable performance. 

3. A SIMPLE MODEL FOR COMPUTE-DATA 
PLACEMENT 

A question that arises in the design of systems and that 
distributed data-intensive applications have to address, is 
whether to assign and move computational tasks to where 
data resides, or to move data to where computational tasks 
are executed. Another fundamental question is when to 
commit to a given approach. Additionally, if replication is 
an option, applications and systems have to determine what 
the degree of replication of data should be, and possibly 
where to replicate. Here we present a simple model to help 
reason along the above lines. We posit that the fundamental 
parameters in an infrastructure agnostic model are: 

• Tq defined as the queue waiting time at a given resource. 

• Tc defined as the compute time. 

• Tx defined as the time to transfer a number of bytes, say 
B bytes, over the network to a given a resource from a 
defined source; this measures the time-in-flight for data. 

• Tig is the staging time, which is defined as the data time- 
in-flight plus the time to get data into a system (i.e. reg- 
ister). Staging time could be for either the upload time or 
download time, thus when referring to upload, Ts = Tu- 
As Ts is is defined as the sum of Tx and Tregister, where 
the latter time is defined as the time to register data into 
the end points (e.g., a thousand files into a catalogue). 
For most scenarios in this paper, Trcgister is negligible 
compared to Tx to a first order, thus Tx ~ Ts- 

• Tr defined as the time to replicate data, where R is the 
rmmber of sites that data is replicated over. 

• Td is the time at which data will be accessible across all 
distributed resources. When replication is involved, it is 
defined as the sum of the Tr{R) and Ts- 

The above expression provides a basis to reason whether to 
distribute/replicate data before determining where to com- 
pute, or to distribute/replicate data after having determined 
which compute resources to use. To a first approximation, 
which of the two approaches should be employed is given by 
the relative values of Td and the typical value of Tq , which 
in turn has a dependence on the data volume under consid- 
eration. The desired mode is also strongly dependent upon 
data volumes in considerations, as well as the capability of 
the tools and middleware in use, however, it is important 
to reiterate that our model is agnostic of specific implemen- 
tation details. As an example, although our focus is on 
Pilot- Abstractions the model per se does not presume any 
dependence on Pilots. 

We begin by considering the simple scenario, where data 
replication is not an option (Case A). Thus, the only non- 
zero contribution to Td is provided by Tx- In this case the 
decision about which entity (D or C) to place/schedule first 
and which to move - compute to data (C2D) or data to com- 
pute (D2C) - is determined via a simple trade-off between 
Tq and Tx- Assuming there is only one site to choose, the 
relative values of the previous two terms is determined for all 
possible sites; the site with the greatest similarity between 



the mean value of queue waiting time and Tx is chosen. If 
Tx is larger than the mean, then the compute is assigned 
to a site first, and subsequently data is placed. Although 
significantly more complex, the above can be generalized to 
schedule/place multiple tasks over multiple sites. Having 
established that, we move our attention to the case when 
replication is an option, i. e.. Case B. 

For Case B, an a prion distribution (and or replication) 
of data can be assumed to exist; in response to this distri- 
bution, compute resources are chosen. Consequently, from 
the list of resources co-located with a data replica, the re- 
source with the lowest queue waiting time presents an op- 
timal choice. However, it is important to appreciate that 
there is an overhead in ensuring that data is replicated in a 
distributed fashion (which we refer to as the upload phase). 
Again, moving compute to data (respectively close to the 
data) vs. data to compute is a degree of freedom. 

In practice hybrid modes can be employed. As an exam- 
ple, distributed data replication can initially be set to be 
partial, viz., only over a sub-set of possible distributed sites. 
In other words, replication might commence over a sub-set 
of suitably chosen nodes, followed by a sequential increase 
in the replication (factor) if compute resources close to the 
replica do not have sufficient compute capacity. 

4. PILOT-ABSTRACTIONS: A UNIFIED AB- 
STRACTION FOR COMPUTE AND DATA 

The seamless uptake of distributed infrastructures by sci- 
entific applications has been limited by the lack of pervasive 
and simple-to-use abstractions at multiple levels - at the de- 
velopment, deployment and execution stages. A survey of 
actual usage suggested that Pilot-Jobs were arguably one of 
the most widely-used distributed computing abstractions [4] 
- as measured by the number and types of applications that 
use them, as well as the number of production distributed 
CI that support them. Although Pilot- Jobs have been used 
by many high-throughput applications, there does not exist 
a well defined, unifying conceptual framework for Pilot- Jobs 
which can be used to define, compare and contrast differ- 
ent implementations. Our survey led us to understand that 
different Pilot-Jobs had different semantics and capabilities, 
and made vast if not inconsistent assumptions of applica- 
tions/users. This presented a barrier to usability and exten- 
sibility of Pilot-Jobs. 

To address this barrier, we have recently developed [i] the 
Pstar (P*) model for Pilot- Jobs, which provides a minimalis- 
tic but the first complete model of Pilot- Jobs, with the func- 
tional objective being to provide a single conceptual frame- 
work that would be used to understand and reason about 
different Pilot-Job implementations. Furthermore, we have 
implemented BigJob (BJ) 5 - a SAGA-based Pilot- Job, as 
a reference implementation of the P* model. By linking the 
theoretical underpinnings and implementation to its usage, 
we have established Pilot- Jobs as a richer and more powerful 
runtime environment than was hitherto considered possible 
or realized. For example, until now Pilot- Jobs have essen- 
tially been "passive" elements - not programmable, nor as a 
means of expressing specific application requirements. The 
P* model promotes Pilot- Jobs to an "active" element, capa- 
ble of supporting application specific scheduling and place- 
ment requirements, as well as coordination and coupling re- 
quirements. This in turn has led to a break from tradition- 
ally constrained and confined usage modes for Pilot- Jobs, 
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Figure 1: BigJob Pilot- Abstractions and Supported 
Resource Types: BigJob provide a unified abstrac- 
tion to a heterogeneous set of distributed compute 
and data resources. Resources are either accessed 
via SAGA j8| |41j or via a custom adaptor. 

as evidenced by a step-change that has arisen in the nature 
and scale of problems using Pilot- Jobs [SOl . 

The P* model defines Pilot-Computes (PC) as the funda- 
mental entity that is submitted to the resource. The appli- 
cation workload is described using so called Compute-Units 
(CUs) and submitted via the Pilot-Compute. The notion of 
Pilot-Data (PD) was conceived using the power of symmetry, 
i. e., the notion of Pilot-Data was as fundamental to dynamic 
data placement and scheduling as Pilot- Jobs was to compu- 
tational tasks. As a measure of validity, the P* model was 
amenable and easily extensible to Pilot-Data. The consis- 
tent and symmetrical treatment of data and compute in the 
model led to the generalization of the model as the P* Model 
of Pilot Abstractions. PD provides late-binding capabilities 
for data by separating the allocation of physical storage and 
application-level Data-Units. Similar to a Pilot-Compute, 
a Pilot-Data provides the ability to create Pilots and to in- 
sert respectively retrieve Data-Units. The Pilot-API [40] 
provides an abstract interface to both Pilot-Compute and 
Pilot-Data implementations that adhere to the P* model. 

4.1 BigJob: A Pilot- Compute and Data Im- 
plementation 

A main limitation of Pilot- Job frameworks is the inabil- 
ity to manage distributed data. From a practical point-of- 
view, data management and movement for most Pilot- Jobs 
is at best obscure if not outright ad hoc. Most Pilot- Jobs 
rely on application-level data management, i. e. data needs 
to be pre-staged or each CU is responsible for pulling in the 
data. In the following, we propose Pilot-Data as a consistent 
interface for data management in conjunction with Pilot- 
Jobs. Similarly to Pilot- Jobs, Pilot-Data is an application- 
level construct that allows the logical decoupling of physical 
storage/data locations and the production and consumption 
of data. Among many things, Pilot-Data facilitates late 
binding of data and physical resources. The Pilot-Data- 
Abstraction aims to support the distributed coupling of dif- 
ferent application components, i. e. the Compute-Units and 
Data- Units, e. g. the parts of a distributed workflow or the 
data flow between a compute and analysis job. 

Pilot-Data is an extension of BigJob (BJ) [5][42], which 
provides a unified runtime environment for Pilot-Computes 
and Pilot-Data on heterogeneous infrastructures (see Fig- 
ure [TJ , along with a higher-level, unifying interface to het- 
erogeneous and/or distributed data and compute resources. 
As Pilot- Jobs eliminate the need to interact directly with 
different kinds of resources, e.g. batch-style HPC/HTC or 
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Figure 2: BigJob High-Level Architecture: The 
Pilot-Manager is the central coordinator of the 
framework, which orchestrates a set of Pilots. Each 
Pilot is represented by a decentral component re- 
ferred to as the Pilot- Agent, which manages the set 
of resources assigned to it. 

cloud resources, Pilot-Data removes the necessity to manu- 
ally interoperate with different data sources and stores, e. g. 
file storage, repositories, databases etc, by providing a uni- 
fied data management and access layer, which can be used 
to allocate and access a heterogeneous set of resources. 

Figure [2] shows the high-level architecture of BigJob. The 
Pilot-Manager is the central entity, which manages the ac- 
tual Pilots via the Pilot-Agent. BigJob supports differ- 
ent resource types via an adaptor mechanism (see adap- 
The adaptor encapsulates the different 
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tor pattern 

infrastructure-specific semantics of the backend system, e. g. 
in the case of Pilot-Data different storage types (e. g. file vs. 
object storage), access and transfer protocols. Figure [l] il- 
lustrates the available adaptors. For Pilot-Computes, BJ 
supports various types of HPC/HTC resources via SAGA- 
Python [41] (e.g. Globus, Torque or Condor resources). 
Further, adaptors for cloud resources (Amazon EC2 and 
Google Compute Engine) exist. Similarly, Pilot-Data can 
marshal different types of storage resources and access pro- 
tocols, e. g. a local filesystem accessed via SSH or a trans- 
port service, such as Globus Online [16], which again utilizes 
GridFTP " 
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as transport protocol. Also, Pilot-Data can 
support more advanced distributed data management sys- 
tems, such as iRODS. The respective resource adaptor is 
selected based on the resource URL to the backend system 
defined by the application in the Pilot-Description. 

Figure [3] shows the typical interactions between the com- 
ponents of the BigJob/Pilot-Data framework after the sub- 
mission of the application workload (i. e. the CUs and DUs). 
The core of the framework is the Pilot-Manager. The Pilot- 
Manager is able to manage multiple Pilot-Agents. The ap- 
plication workload is submitted to the Pilot-Manager via 
the Compute-Data Service interface of the Pilot-API (see 
After submission, DUs and CUs are put into 
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section 

a in-memory queue, which is continuously processed by the 
scheduler component. This asynchronous interface ensures 
that the application can continue without needing to wait 
for BigJob to finish the placement of a CU or DU. 

Another core component is the distributed coordination 
service that is used for communication between the Pilot- 
Manager and Pilot-Agent. Currently, BigJob relies on Pe- 
dis for this purpose. Other kinds of coordination services 
are supported via a plugin scheme. The main task of Redis 
is to facilitate the communication and coordination between 
Pilot-Manager and Agents: (i) The Pilot- Agent collects var- 
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Figure 3: BigJob Application Workload Manage- 
ment: The figure illustrates the typical steps in- 
volved for placing and managing the application 
workload, i. e. the DUs and CUs. 

ious information about the local resource, which is is pushed 
to the Redis server and used by the Pilot-Manager to con- 
duct e. g. placement decisions; (ii) CU are stored in several 
queues. Each Pilot-Agent generally pulls from two queues: 
its agent-specific queue and a global queue. Since the Redis 
server is globally available, it also serves as central repository 
that enables the seamless usage of BigJob from distributed 
locations. That means that application can connect back to 
its entities (i. e. Pilots, DUs, and CUs) via a unique URL 
and query it for state information and update it. 



4.2 



Pilot- API: Managing Distributed, Data- 
Intensive Resources and Computations 



The Pilot- API 40 provides a well-defined control and pro- 
gramming interface for pilots, which is built upon the con- 
sistent and well-defined semantics associated with pilots [2] . 
It is an interoperable and extensible API which exposes the 
core functionalities of a Pilot framework via a unified in- 
terface providing a common API that can be used across 
multiple distinct production cyberinfrastructures . It offers 
a unified API for managing both compute and data pilots as 
well as application workloads. BigJob provides a full imple- 
mentation of the Pilot-API and enable the management of 
resources, CUs & DUs as well as the relationships between 
them. Specifically, the Pilot-API promotes affinities as a 
first class characteristic for describing such relationships be- 
tween compute and data elements defined by the P* model 
and to support dynamic decision making. 

The API separates the concerns (i) management of pi- 
lots and resources and (ii) application workload management 
(see Figure [4|. The Pilot-Compute Service is responsible for 
controlling the lifecycle of Pilot-Computes; the Pilot-Data 
Service manages Pilot-Data. Using Pilot-Abstractions the 
different types of distributed compute resources, storage 
infrastructures and transport protocols etc. can be mar- 
shaled providing a seamless, unifying environment to the 
application for managing resources. Applications can sim- 
ply acquire resources in form of Pilots and then assign their 
workload to these resources. 

The application or tool that uses the Pilot- API utilizes 
Data-Units (DUs) and Compute-Units (CUs) as abstraction 
for expressing the application's workload. A CU represents a 
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Figure 4: Control Flow Pilot- API: The API exposes 
two primary functionalities: The PiiotComputeSer- 
vice and PilotDataService are used for the manage- 
ment of Pilot-Computes and Pilot-Data. The appli- 
cation workload is submitted via the Compute-Data 
Service. 

primary self-containing piece of work, while a DU represents 
a self-contained, related set of data. A DU is a container 
for a logical group of "affine" data, e. g. data that is often 
accessed together by various Compute-Units. A Compute- 
Unit is a computational task that potentially operates on a 
set of input data, specified in form of one or more dependent 
Data-Units. 

The Compute-Data Service is the central abstraction for 
submitting and managing this application workload, i. e. the 
Compute-Units and Data-Units. Both CUs and DUs are 
submitted to the runtime system via the Compute-Data Ser- 
vice interface. The runtime system is responsible for placing 
CUs and DUs on a Pilot (placed on a respective resources). 
For this purpose, the runtime system relies e. g. on the lo- 
calities of the DUs, to facilitate scheduling and other types 
of decision making (see section [5| . 

4.3 Experiences on Different Distributed In- 
frastructures 

The objectives of the following experiment is to demon- 
strate the ability of Pilot-Data to marshal different storage 
backend infrastructures. Using Pilot-Data we will attempt 
to characterize the behavior of PD on different cyberinfras- 
tructures (e. g. XSEDE and OSG) before we investigate ad- 
vanced compute/data placements strategies in section [6] 

For our experiments, we utilize GW68 - a gateway node 
located at Indiana University and part of the XSEDE infras- 
tructure - as our submission machine to both XSEDE and 
OSG resources. Figure |5] illustrates Tu for different back- 
ends, i. e. the time necessary to populate a Pilot-Data store 
on different infrastructures, i. e. (i) a directory of the work 
filesystem on Lonestar, (ii) a S3 bucket and (iii) iRODS 
on OSG. For scenario (i), we utilize different data move- 
ment systems to access the data: SSH and Globus Online/- 
GridFTP. For all scenarios, we only consider the Tu part of 
Td (which is equivalent to Tg), i.e. for the iRods scenario 
Tr is not depicted. 

Figure |5] illustrates the results of this experiment. The 
performance of Pilot-Data primarily depends on the infras- 
tructure used. In general, the overhead of Pilot-Data itself 
is very low. Tu is dominated by Tx, i. e. the time necessary 
to transfer files to the Pilot-Data location. Experiments 
with smaller data sizes have shown that Tregister is negligi- 
ble. Thus, the runtime is directly influenced by the available 
bandwidth and the characteristic of the respective transfer 
protocol. While for smaller data sizes SSH performs best. 
Globus Online particularly performs well for larger sizes; in 
particular by relying on GridFTP as transfer protocol. Tu 
for iRods behaves comparable to Tu for SSH. Tu for S3 in- 
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Figure 5: Pilot-Data on Different Infrastructures: 
Time to initialize a Pilot-Data with a dataset of 
given size. For iRods, we measure only the upload 
time and not the time required to replicate the data. 
SSH/iRods perform well for small data sizes; Globus 
Online for larger data volumes. S3 performance is 
limited by Internet connectivity. 

creases linearly - an indicator that it is bound to the avail- 
able Internet bandwidth. 

If system-level support for replication is provided, e. g. by 
a distributed data management middleware such as iRods, 
Pilot-Data can utilize this capability as a dynamic caching 
mechanism (to be contrasted with the usage of iRODS for 
data storage and management). While we previously only 
considered Tu, we continue our investigation with an evalua- 
tion of Ta, 1. e. the replication overhead. In the following ex- 
periment, we investigate Tr for different infrastructures and 
configurations: (i) iRODS/OSG with group-based replica- 
tion, (ii) iRODS/OSG with sequential repHcation in which 
one repHca is created after the other, and (iii) EGI with se- 
quential replication using SRM/LFC. For this purpose, the 
data is replicated to resource sets of different types and sizes: 
on EGI to 11 SRM resources of the Dutch e-Science Grid, 
in the OSG scenario to 6 iRODS resources and in the OSG 
(osgGridFTPGroup) scenario to 9 iRODS resources that are 
members of this group. 

Figure |6] illustrates the results. On OSG the Tr, for the 
group-based replication is significantly better than for the se- 
quential replication. In both cases, the frequency of failures 
was very high. While the osgGridFtpGroup group consisted 
of 9 nodes, the average number of resources that actually 
received a replica was ~7.5. In general, the overall per- 
formance is determined by the available bandwidth between 
the central iRODS server (located at Fermilab near Chicago) 
and the individual sites. For 4GB case in scenario (ii), the 
individual Tx are depicted in the inset of Figure [6] While 
the performance of the EGI scenario is very good, it must 
be noted that the resources are not as geographically dis- 
persed as in the OSG case, where resources are distributed 
across the east, south and center of the US. Nevertheless, 
the sequential replication is well suited for creating a small 
number of replicas. The different sites have very different 
performance characteristics. Thus, the ability for applica- 
tions to optimize data/compute placement with respect to 
their computational and data requirement is critical. If the 
compute resources of the three "closest" sites are sufficient 
and available, it is e. g. not necessary to replicate all data 
across OSG. 



Figure 6: Replication on EGI and OSG: Tr on dif- 
ferent infrastructures and resource sets: on EGI to 
11 SRM resources of the Dutch e-Science Grid, in 
the OSG scenario to 6 iRODS resources and in the 
OSG (osgGridFTPGroup) scenario to 9 iRODS re- 
sources that are members of this group. The inset 
shows the distribution of Tr with respect to the dif- 
ferent hosts for the 4 GB & OSG/iRODS scenario. 

5. DATA PLACEMENT AND COMPUTE- 
DATA CO-SCHEDULING 

The aim of Pilot-Data is to enable the efficient manage- 
ment of data in conjunction with compute. Different in- 
vestigations (e. g. [44[ 37 ) have shown that when consider- 



ing data/compute entities equally while making placement 
decisions leads to performance gains. An important conse- 
quence of data and computation as equal first-class entities, 
is that either data can be provisioned where computation 
(D2C) is scheduled to take place (as is done traditionally), 
or compute can be provisioned where data resides (C2D). 
This equal assignment of Compute-Units leads to a richer set 
of possible correlations between the involved DUs and CUs; 
correlations can be either spatial and/or temporal. These 
correlations arise either as a consequence of constraints of 
localization (e. g., data is fixed, compute must move, or vice- 
versa), or as temporal ordering imposed on the different data 
and computational units. 

We propose affinities as an abstraction for managing co- 
locations of Pilot-Data, Pilot-Computes and DUs/CUs in 
section |5.1| We continue with a description on how PJ 
frameworks, such as BigJob, can use affinities for making 
data/compute placement decisions in section [K2] 



5.1 Affinities: Managing Relationships be- 
tween Data and Compute 

Managing data locality in a distributed environment is a 
great challenge due to the heterogeneous landscape of data/- 
compute infrastructures coupled with the large amount of 
dynamism associated with distributed environments. Tra- 
ditionally, Pilot-Abstractions have been used to provide 
a uniform interface to these heterogeneous infrastructures. 
We propose affinities as an extension to Pilot- Abstractions. 
Affinities are an essential tool for modeling different resource 
topologies, application characteristics and allow for effective 
reasoning about resource topologies and data/compute de- 
pendencies. We will show in section[6]that the usage of affini- 
ties in conjunction with Pilot- Abstractions for distributed 
data will yield to better performance and scalability com- 
pared to simplistic if not ad-hoc ways of data-compute cou- 
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Figure 7: Affinities between Distributed Resources: 
Pilot-Data assigns each resource an affinity based 
on a simple hierarchical model. The smaller the dis- 
tance between two resources, the larger the affinity. 

pling while still maintaining a high level of simplicity. 

The concept of affinity is used to describe the relationship 
between the different entities of a PJ-Framework (defined in 
the P* model e.g. the relationship between CUs, DUs, 
Pilots and resources. This model enables the framework 
to reason about different trade-offs, e. g. to make decisions 
on whether on to move data versus move the compute (see 
section 5.21 based on resource and bandwidth availabilities. 

The model supports different kinds of affinities: (i) Pi- 
lot/resource affinities describe the relationship between 
multiple pilots and the resources they are running on; (ii) 
compute/data affinities describe the relationships between 
CUs and DUs, which can exist in various forms, e.g. D-D, 
C-D, or C-C. 

Resource/Pilots Affinities: Resource affinities describe 
the relationship between a set of compute and/or storage re- 
sources. We use a simple model for describing resource affini- 
ties: Data centers and machines are organized in a logical 
topology tree (similar to the tree spawned by an DNS topol- 
ogy). The further the distance between two resources, the 
smaller their affinity. Figure [7| shows e. g. how a distributed 
system consisting of different types of cloud and grid re- 
sources can be modeled. Using such an resource topology, 
the runtime system can deduce the connectivity between 
two resources, to estimate e. g. the costs induced by a po- 
tential data transfer. While this model is currently very 
coarse grained, it can be enhanced by assigning weights to 
each edge to reflect dynamical changes in factors that con- 
tribute to connectivity . 

The affinity of Pilot is determined based on the resource 
it is located on, i. e. the proximity of two Pilots is deduced 
from the distance of their resource in the resource topology 
tree. The mapping between resource and Pilot is done by 
assigning each Pilot a logical location using a user-defined 
affinity label in the Pilot-Description. This logical location 
assignment is utilized by the scheduler to create the resource 
topology tree. 

Compute/Data Affinities: As described, applications 
utilize DU as a primary abstraction for grouping data. CUs 
can have input and output dependencies to a set of DUs, 
i. e. the data of these DUs is required for the computational 
phase of the CU. The output data is automatically writ- 
ten to one or more output DUs. The framework utilizes 
these affinities to place DUs and CUs into a suitable Pilot- 
Compute or Pilot-Data. Further, CUs and DUs can con- 
strain their execution resource to a particular affinity (e. g. 
to a certain location or sub-tree in the logical resource topol- 
ogy). The runtime system then ensures that the data and 
compute affinity requirements of the CU/DU are met. 

Figure [s] shows typical data/compute dependencies and 
a typical data flow between multiple phases of compute. 
Applications can declaratively specify CUs and DUs and 
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Figure 8: DU and CU Interactions and Data Flow: 
Each CU can specify a set of input DU. The frame- 
work will ensure that the DU is transferred to the 
CU. Having terminated the job, the specified output 
is moved to the output DU. 

effectively manage the data flow between them using the 
Pilot-API. A CU can have as described input and output 
dependencies to a set of DUs. For this purpose, the API 
declares two fields in the Compute-Unit Description: in- 
put_data and output_data that can be populated with a 
reference to a DU. The runtime system ensures that these 
dependencies are met when the CU is executed, i. e. either 
the DUs are moved to a Pilot that is close to the CU or the 
CU is executed in a Pilot close to the DU's Pilot. In the best 
case, the Pilot-Data of the depended DUs is co-located on 
the same resource as the CU, i. e. the data can be directly 
accessed via a logical filesystem link. Otherwise, the data is 
moved via a remote transfer. Further, a CU can constrain 
its execution location to a resource with a certain affinity. 
If specified, the scheduler will only place the CU in a Pilot 
with the right affinity. 

The input data is made available in the working directory 
of the CU. As described, depending on the locality of the 
DU/CUs, different costs can be associated with this opera- 
tion. The runtime system relies on an affinity-aware sched- 
uler that ensures that data movements are minimized and 
that if possible "affine" CUs and DUs are co-located (see sec- 
tion [5| . Typically, scientific applications also involve mul- 
tiple steps of data generation and processing. As depicted 
in Figure m the Pilot-API can be used to efficiently man- 
age data flow between a set of dependent CUs. The input 
data can be pulled in from a remote resource or a data man- 
agement system, such as iRODS. Intermediate data can be 
stored locally for further processing in a second stage (where 
a filtering and/or aggregation can take place) before the data 
is moved to a remote resource. 

5.2 Pilot-based Scheduling 

The Pilot-Abstraction provides the basis for application- 
level scheduling. Applications have two options: (i) rely 
on the internal scheduler of the PJ framework (i. e. BigJob) 
or (ii) make placement decisions based on the Pilots it has 
acquired, i. e. it can manually place CUs and DUs in spe- 
cific Pilots. The latter gives an application a high degree of 
freedom, allowing it to pursue strategies optimal for itself, 
e. g. by choosing the right numbers of replicas for a certain 
amount on compute that needs to be carried out on the data. 

The scheduler of BigJob is a plug-able component of the 
runtime system and can be replaced if desired. The default 
implementation referred to as the affinity-aware scheduler 
relies on the resource topology information specified via the 
Pilot-API. This scheduler is hidden behind the Compute- 
Data Service interface. It uses the specified affinities 
to reason about the relationships between DUs, CUs, Pilots 
and resources to optimize data localities and movements. 

The affinity-aware scheduler currently implements a sim- 



pie strategy based on earlier research |44] that suggests that 
considering both data and compute during placement deci- 
sions leads to a better performance. As shown in Figure |3] 
BJ relies on two queues for managing CUs. CUs without any 
affinity are assigned to the global queue from where they can 
be pulled from multiple Pilot-Agents. If there is affinity to 
a certain Pilot because the input data resides in this Pilot- 
Data, the CU can be placed in a Pilot specific queue. For 
each CU the following steps are executed: 

1. The Pilot-Manager attempts to find a Pilot that best ful- 
fills the requirements of the CU with respect to (i) the 
requested affinity and (ii) the location of the input data. 

2. If a Pilot with the same affinity exists and Pilot has an 
empty slot, the CU is placed in this pilots queue. 

3. If delayed scheduling is active, wait for n sec and recheck 
whether Pilot has a free slot. 

4. If no Pilot is found, the CU is placed in global queue and 
pulled by first Pilot which has an available slot. 

The Pilot- Agent that pulled the CU from a queue is respon- 
sible for ensuring that the input DU is staged to the correct 
location, i. e. before the CU is run, the DU is made avail- 
able in the working directory of the CU either via remote 
transfer or a logical link. 

In summary, Pilots provide a well-defined abstraction to 
resources supporting effective application-level/PJ frame- 
work level resource management without the necessity to 
deal with low-level, infrastructure-specific details. Pilot- 
Abstractions and affinities enable applications as well as the 
PJ framework runtime system to trade-off different place- 
ment options based on the information provided by the affin- 
ity assignment, such as resource localities, and dynamic in- 
formation, such as resource and bandwidth availabilities. 

6. EXPERIMENTS 

In this section we will establish the importance of the 
Pilot-Data abstraction by examining two different aspects, 
viz., (i) the ability to provide uniformity of access and us- 
age modes for different infrastructures (e. g. XSEDE and 
OSG), (ii) performance advantage arising from the ability 
to select "optimal" usage modes. Having investigated differ- 
ent strategies of managing compute-data placements in sec- 
tion [sj we explore various compute/data placement strate- 
gies using a part of a next-generation sequencing scenario - 
the read alignment process using BWA - in this section. 

BWA Ensemble and Pilot-Abstractions 

For this scenario, we utilize the Pilot-API to manage the 
input /output data and the compute of the BWA genome se- 
quencing applications. The application requires two kinds of 
input data: (i) the reference genome and index files (~8 GB) 
and (ii) the read file(s) (~2 GB) obtained from the sequenc- 
ing machines. The alignment process can be parallelized by 
partitioning the read files and processing them using con- 
current BWA instances. 

Figure |9] shows the scaling behavior of BWA processing 
read files with a total volume of 2 GB partitioned across dif- 
ferent numbers of Compute-Units. For this experiment we 
utilize the XSEDE machine Lonestar; for each CU two cores 
are allocated in the Pilot. The Pilot-Data for the input data 
is placed co-located with the Pilot-Compute on the Lustre 
filesystem of Lonestar. The time to completion for this sce- 
nario reduces as the number of CUs increases. During the 
experiments, the Pilot queuing time Tqj,.^^^, ie. the time 



4000 



.i 2000 




Number of Compute Units 

Pilot Queue Tq„„ CU Staging Ts„„ CU Compute Tc BRuntime T 



Figure 9: BWA using Pilot- Abstractions on 
XSEDE/Lonestar: Using concurrent BWA in- 
stances, the runtime improves with the increasing 
number of CUs. The maximum speedup is 10 (see 
inset); the speedup saturates with >16 CUs as the 
overhead induced by BigJob and file management 
increases. 

the Pilot waited in the local resource manager's queue, was 
nearly constant (regardless of the amount of requested re- 
sources). 

The CU queue time Tq^.^ describes the coordination over- 
head for scheduling, queuing and spawning the application 
process; the CU staging time Ts^u ^he average time nec- 
essary to move the input files to the working directory of the 
CU. If both data and compute are co-located (as in this sce- 
nario), a logical link to the necessary files are created. The 
more CUs, the more files need to managed; thus, a slight 
increase in the time is observable as the number of CUs in- 
creases. The largest component is the necessary compute 
time Tc for the BWA application. As shown, the time de- 
creases with the amount of data that needs to be processed 
by each CU. 

In summary, the BWA scenario can be effectively paral- 
lelized by distributing the load to multiple compute units. 
The maximum achievable speed is 10 (see inset of Figure |9|; 
however the efficiency drops with > 16 CUs. With the in- 
creasing number of concurrent CUs, the ratio between over- 
head caused by the management of the additional CUs and 
the actual runtime of the BWA CU becomes unfavorable 
leading to a reduced speedup with 32 concurrent CUs. 

Off-loading of some CUs to distributed/remote resources 
is a viable strategy to minimize this bottleneck. The main 
barrier for utilizing distributed resources is the necessity to 
move the data to this resource. We evaluate when such 
strategic distribution may be useful and different strategies 
for utilizing distributed resources in the next section. 



Advanced Compute/Data Placement 

The aim of this section is to show how system-level and 
application-level data management based on Pilot-Data can 
be combined to efficiently manage data and compute in dis- 
tributed, heterogeneous environments. Infrastructures sig- 
nificantly differ in the way they manage data and compute; 
for example, on XSEDE 45 it is generally possible to place 



data on the distributed filesystem mounted to all compute 
nodes. On OSG this is not simply possible since users 
generally cannot access compute nodes without Condor; 
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Table 1: Pilot- Compute/Pilot-Data Configurations: 
We run a total of 8 CUs of BWA each processing 
8.3 GB data (reference genome & read file). We 
investigate different Pilot-Compute and Pilot-Data 
placements on OSG and XSEDE. 



however, the iRODS service on OSG enables the application 
to push data to the different OSG resources. These different 
kinds of semantics increases the complexity of applications. 
The Pilot-API and the affinity framework provide applica- 
tions a tool for mapping different system semantics, e. g. 
system-level replication vs. no replication support, to a uni- 
fied, logical resource topology, which allows applications to 
reason about trade-offs and pursue different compute/data 
placements strategies as laid out in section [3] e.g. bringing 
compute to data (C2D) versus data to compute (D2C). 

For the purpose of this experiment we utilize BWA to 
process 2 GB of sequence read files using chunks of 256 MB 
and 8 CUs. Table [l] summarizes the Pilot-Compute/Pilot- 
Data configurations used. In scenario 1, we spawn 8 Pilots 
on OSG (in HTC environments such as OSG, a Pilot can 
only marshal a single core). The input data is placed in 
a Pilot-Data on the Engage submission node, i. e. for each 
CU a remote data transfer between the OSG worker node 
and the Engage submission node is necessary. Scenario 2 
utilizes the same Pilot-Compute setup, but optimizes data 
placements by using iRODS, which is used to make the input 
dataset directly accessible at the compute node. Scenario 3 
consists of resources located on two different infrastructures: 
OSG and Lonestar/XSEDE. The input data set resides on 
a PD on Lonestar; further, we submit 1 PC on Lonestar 
marshaling 1 node with 12 cores and 4 PCs on OSG. Finally, 
in scenario 4, both the PC and PD are located on Lonestar. 

In all scenarios, the input data set consisting of the index 
files and the read file is grouped in a DU. In scenario 1, 3 
and 4 the DU is placed in a non-replicated, SSH-based Pilot- 
Data (case A in our model, see section |3|, while m scenario 
2 we utilize the system-level replication support of iRODS 
(case B in our model). The OSG Pilot-Computes are sub- 
mitted using the SAGA-Python Condor adaptor |4T] and 
GlideinWMS [46], a higher-level workload management sys- 
tem built on top of the Pilot capabilities of Condor- G/ Glide- 
in 19 . We restrict OSG resources to a set of 9 machines, 
which are supported by the OSG iRODS installation. The 
resources are distributed across the eastern and central US 
including e.g. resources at TACC, Purdue and Cornell. We 
explore different placement strategies: in scenario 1, the PD 
is located on a fixed resource and the Pilot-Compute are 
with locality constraint submitted to OSG (D2C). In sce- 
narios 2 and 4 in contrast, the Pilot- Compute resources are 
constrained to the location of the data (C2D). Scenario 3 
represents a hybrid scenario: we allocate a Pilot-Compute 
close to the data on Lonestar and additional Pilot-Computes 
on OSG; the OSG Pilot-Computes are geographically dis- 
tributed to the Pilot-Data. CUs are bound late to the first 
Pilot-Compute with a free slot. 




2. OSG/iRODS 3. Lonestar/ 4. Lon 

OSG/SSH 



4. Lonestar/SSH 



Scenario 

p Pilot Queue Tqp,,^, CU Queue To^.^ CU Compute Tc JCU Staging Ts^.^ ^Runtime T 



Figure 10: Pilot-based Genome Sequencing on 
XSEDE and OSG: Runtimes for running BWA on 
2 GB of sequence read files using 8 CUs and differ- 
ent infrastructure configurations. 



Figure [To] shows the runtime of different scenario consist- 
ing of Tq, Tc and To. The insert describes Td, i.e. the 
time for uploading, inserting and in case of iRODS repli- 
cating 8.3 GB of input data. In general, the Pilot queuing 
times, i. e. the time until a Pilot becomes active, are higher 
for OSG than on XSEDE. The queueing time mainly de- 
pend on two factors: the current utilization of the resource 
and the overhead induced by the queuing system. Thus, the 
higher on average queuing time in the OSG scenarios can be 
explained by mainly the GlideinWMS system, which must 
spawn the requested Condor Glide-Ins as well as by the fact 
that for each core a separate Pilot needs the spawned. That 
means that for the same scenario on OSG 8 Pilots are started 
in contrast to 1 on XSEDE. 

Another limiting factor in scenario 1 is the necessity to 
pull in the data remotely. Since OSG resources are remote 
to the data, the data transfer becomes a bottleneck: for 
each CU 8.3 GB of data need to be transferred. In sce- 
nario 2 we utilize the data/compute co-location capabilities 
of the OSG Condor and iRODS installation. The runtime 
T is significantly improved mainly due to the elimination of 
data transfers. However, the upfront costs for creating the 
PD and replicating the data across OSG are very high - 
"^Dinods ~1, 418 sec, Tdssh ^^^^y ~338sec for Engage, 
but does not have a replication component (see inset of Fig- 
ure 10 1. Thus, Tscu is significantly higher for SSH than for 
iRODS. However, even after including Td, the performance 
of iRODS is 30 % better than for the SSH scenario. 

Scenario 3 investigates the ability to spawn both PCs and 
Pilot Data Service across multiple infrastructure. Since the 
input data resides in a PD on Lonestar, the staging time 
for CUs on Lonestar is significantly reduced. Since the Pi- 
lot queuing time on Lonestar was smaller than on OSG, 
the majority of the CUs have been executed on Lonestar; 
on average 4.5 out of the 8 CUs are run on Lonestar. Fi- 
nally, scenario 4 shows that if sufficient compute resources 
are available close to the data it is beneficial to execute CUs 
close to the data. 

In summary. Pilot- Abstractions enable the interoperable 
use of different infrastructures as well as the use of different 
strategies for allocating distributed data and compute re- 
sources. The Pilot- Abstractions enable applications to map 
system-level capabilities, e. g. OSG specifics with respects to 



compute handling and the iRODS based rephcation system, 
to a unified, logical resource topology, which enables the ap- 
plication to reason about trade-off, such as Tq vs. Td for 
a given amount of compute and data, in order to achieve 
the best possible performance. While compute-bound ap- 
plications usually allow the usage of simple heuristics for 
predicting Tq on different resources, data-intensive appli- 
cations require further considerations. Moving data has a 
significant impact on the overall performance. In particular 
in the above experiment, where we observed very low Pi- 
lot queuing times, bringing compute to data shows a better 
performance than the other way around. 

7. DISCUSSION AND FUTURE WORK 

The P* Model suggests new and enhanced usage modes 
for Pilot-Abstractions. For many of these new modes, as 
well as existing Pilot- Job usage scenarios, effective data co- 
placement is, or soon becomes, the barrier. Thus, at the 
most basic level, Pilot-Data is an associate/adjunct to Pilot- 
Jobs and enhances the utility and usability of Pilot- Jobs. 
The advantage of Pilot-Data, however, extends well beyond 
that; through a series of experiments that cover a range of 
often realized distributed configurations and scenarios, e. g. 
bringing compute to data and vice versa, we have seen how 
Pilot-Data provides a powerful abstraction for distributed 
data. Pilot-Data combined with Pilot- Jobs provide a unified 
Pilot- Abstraction, which collectively taken, enables effective 
management and co-placement of computational tasks and 
associated data in heterogeneous, distributed environments. 

Our focus has been to establish the impact of Pilot- 
Abstractions on multiple production DCI. These DCI ex- 
pose different interfaces, middleware and tools; Pilot-Data 
hides the semantic differences of these backends, and pro- 
vides a unified abstraction to this vast but often inconsistent 
data-cyberinfrastructure, and thereby reinforcing the inter- 
operable nature of our Pilot- Abstraction implementations. 

Our discussion of affinities suggested they are a good ab- 
straction for capturing relationships between computational 
tasks and associated data as well as helping map these de- 
pendencies to Pilots. Affinities can also be used to map Pi- 
lots and resources; in general a multi-level affinity model can 
be used to enhance multi-level scheduling capabilities, a fea- 
ture often associated with Pilot- Jobs. Not surprisingly, the 
Pilot- Abstraction supports the concept of affinity; in fact, an 
arbitrarily sophisticated model of affinity can be employed 
in conjunction with the Pilot-Abstraction. By mapping Pi- 
lots to resource affinities, applications can reason about per- 
formance trade-offs associated with different CU/DU place- 
ments. In general, utilizing more than non-trivial resource 
and application characterization allows for effective schedul- 
ing. We will explore some of these themes in future work. 

Pilots represent an excellent basis for building higher-level 
capabilities (e. g. a workload management system or a work- 
flow engine) on distributed infrastructures. Using Pilot- 
Abstractions, higher-level frameworks can request resources 
(Pilots), on which they then can run tasks (Compute-Units). 
In fact, we believe Pilots are powerful enough abstractions 
that they can serve as the run-time capabilities at the inter- 
face of applications and resources. Along with a general pur- 
pose model for affinity, examining how Pilot-Abstractions 
can serve to build higher-level capabilities and middleware 
for distributed systems is an avenue that we will explore in 
the near future. 
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